Skip to content

Conversation

@mveroone
Copy link

@mveroone mveroone commented Sep 27, 2025

Which problem is this PR solving?

The current implementation of the Event Loop Utilization passes a delta value to the call instead of an absolute one.
The NodeJS perf_hooks documentation is a little ambiguous but it does say that if calling eventLoopUtilization() with 1 argument, it should be the result of a call to that same function without argument :

utilization1 <Object> The result of a previous call to eventLoopUtilization().

(Emphasis is mine)

The result of this bug is that the value tends to stabilize over time because we pass a diff of a diff of a diff and we tend to just return the value since the start of the process instead of a delta since last execution

Short description of the changes

Replaced the setting of the lastValue internal variable with a call to the argument-less perf_hook.
Given that this only queries internal counters, I believe it's light enough that we can afford to call it twice per tick. The alternative would be to bypass the auto-calculation of the delteas provided as a helper and perform calculation of the ratio ourselves with a couple arithmetic operations.

Note

As a reference, Datadog library fixed the same bug last month, but they chose to disregard nodejs autocalculation of the utilization ratio and just do it themselves
DataDog/dd-trace-js#6344
(line 259 in the new version of the file, search for "elu" if needed)

@mveroone mveroone requested a review from a team as a code owner September 27, 2025 14:25
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 27, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: david-luna / name: David Luna (0260d5d)
  • ✅ login: mveroone / name: Maxime Véroone (a5fe644)

@github-actions github-actions bot requested a review from d4nyll September 27, 2025 14:25
@mveroone mveroone force-pushed the fix/runtime_metrics/elu branch 2 times, most recently from 3ac9601 to 8d0191d Compare October 2, 2025 15:20
@mveroone
Copy link
Author

mveroone commented Oct 2, 2025

@d4nyll CCLA Signed. Sorry for the delay.

Note : I'm available for discussing it upon need, ideally during Europe business hours, but can arrange otherwise.

@d4nyll
Copy link
Member

d4nyll commented Oct 13, 2025

Hey @mveroone, thank you for raising the issue and apoloigies for taking my time on it. It is indeed a big bug.

In your fix, there are areas of code for which ELU metrics being reported won't take into account:

const elu = eventLoopUtilizationCollector(this._lastValue);
// From here
observableResult.observe(elu.utilization);
this._lastValue = elu;
// To here
this._lastValue = eventLoopUtilizationCollector();

Whilst it's not such a big deal (it's only the timespan of running those two lines), it would be preferable to leave no time gaps with what's being captured.

Calling eventLoopUtilizationCollector() twice this way also calls process.hrtime() twice under the hood.

What do you think about this implementation instead?

const currentELU = eventLoopUtilizationCollector();
const deltaELU = eventLoopUtilizationCollector(currentELU, this._lastValue);
this._lastValue = currentELU;
observableResult.observe(deltaELU.utilization);

It will:

  • ensure there are no time gaps in the ELU measurements
  • only call process.hrtime() once, as the second call (i.e. eventLoopUtilizationCollector(currentELU, this._lastValue)) will only perform a subtraction.

@mveroone
Copy link
Author

Hey @d4nyll ,
Thanks for taking the time to review this. That's a great catch, I had completely missed this. (being really not accustomed to developing in general and TS/JS in particular).

Your solution also has the advantage of being way more self-explaining and should likely confuse future readers less than the previous version.

I took the liberty to commit your suggestion, hope that's fine by you ?

@d4nyll
Copy link
Member

d4nyll commented Oct 21, 2025

@mveroone Hey! On my local branch I added the following test to packages/instrumentation-runtime-node/test/event_loop_utilization.test.ts make sure we get this right 100%.

  it('should correctly calculate utilization deltas across multiple measurements', async function () {
    // This test ensures the bug where delta of deltas was observed instead of deltas of absolute values
    // does not regress. See https://github.com/open-telemetry/opentelemetry-js-contrib/pull/3118
    // This bug would surface on the third callback invocation.

    const instrumentation = new RuntimeNodeInstrumentation({});
    instrumentation.setMeterProvider(meterProvider);

    // Helper function to create blocking work that results in high utilization
    const createBlockingWork = (durationMs: number) => {
      const start = Date.now();
      while (Date.now() - start < durationMs) {
        // Busy wait to block the event loop
      }
    };

    // Helper function to collect metrics and extract utilization value
    const collectUtilization = async (): Promise<number> => {
      const { resourceMetrics } = await metricReader.collect();
      const scopeMetrics = resourceMetrics.scopeMetrics;
      const utilizationMetric = scopeMetrics[0].metrics.find(
        x => x.descriptor.name === METRIC_NODEJS_EVENTLOOP_UTILIZATION
      );

      assert.notEqual(utilizationMetric, undefined, 'metric not found');
      assert.strictEqual(utilizationMetric!.dataPoints.length, 1, 'expected one data point');

      return utilizationMetric!.dataPoints[0].value as number;
    };

    // Wait for some time to establish baseline utilization
    await new Promise(resolve => setTimeout(resolve, 200));

    // First collection
    const firstUtilization = await collectUtilization();
    assert.notStrictEqual(firstUtilization, 1, 'Expected utilization in first measurement to be not 1');

    // Second measurement: Create blocking work and measure
    createBlockingWork(50);
    const secondUtilization = await collectUtilization();
    assert.strictEqual(secondUtilization, 1, 'Expected utilization in second measurement to be 1');

    // Third measurement: Create blocking work again and measure
    // This is where the bug would manifest - if we were observing delta of deltas,
    // this measurement would not be 1
    createBlockingWork(50);
    const thirdUtilization = await collectUtilization();
    assert.strictEqual(thirdUtilization, 1, 'Expected utilization in third measurement to be 1');

    // Fourth measurement (should be the same as the third measurement, just a sanity check)
    createBlockingWork(50);
    const fourthUtilization = await collectUtilization();
    assert.strictEqual(fourthUtilization, 1, 'Expected utilization in fourth measurement to be 1');

    // Fifth measurement: Do some NON-blocking work (sanity check, should be low)
    await new Promise(resolve => setTimeout(resolve, 50));
    const fifthUtilization = await collectUtilization();
    assert.ok(fifthUtilization < 0.1, 'Expected utilization in fifth measurement to be less than 0.1');
  });

On close inspection, for my suggested code / your last commit (96ec7a4) to work on the first scrape, _lastValue can't be undefined, otherwise on the first scrape const deltaELU = eventLoopUtilizationCollector(currentELU, this._lastValue); would effectively be const deltaELU = eventLoopUtilizationCollector(currentELU); which would give a delta between the previous line (i.e. const currentELU = eventLoopUtilizationCollector();) and this line, resulting in a very small deltaELU that looks something like deltaELU { idle: 0, active: 0.6307079792022705, utilization: 1 }. So the first scrape will always give a utilization of 1.

So I think the last thing we need to do here is:

  1. Change the line private _lastValue?: EventLoopUtilization; to private _lastValue: EventLoopUtilization = eventLoopUtilizationCollector();. This should give these values in the test I wrote:

    firstUtilization 0.003609673196669925
    secondUtilization 1
    thirdUtilization 1
    fourthUtilization 1
    fifthUtilization 0.029487388913258965
    

    (Doing this does mean the utilization before EventLoopUtilizationCollector is initialized (~100ms) is lost, but I think that's fine. If someone really cares about the startup utilization, they can run EventLoopUtilization themselves at a specific point in their code where they deem startup is finished, and not rely on the metric being scraped)

  2. Add the test to packages/instrumentation-runtime-node/test/event_loop_utilization.test.ts to prevent this from regressing


I appreciate this PR is turning into a bigger one than you first imagine, but thanks for working with me to get this bug squashed!

@mveroone
Copy link
Author

mveroone commented Oct 24, 2025

Thanks a lot @d4nyll
This is awesome.

I appreciate this PR is turning into a bigger one than you first imagine, but thanks for working with me to get this bug squashed!

No problem, I too prefer we arrive at the right robust and future-proof solution instead of a quick and simple one.

Is that fine if I commit both of your suggestions myself ? I'm not accustomed to this community's traditions, and I definitely wouldn't want to rob your of attribution for your work.
Anyway I'll try to do it on my side if only to learn how testing works here (Being initially a sysadmin by trade, these are things I'm learning late), but will await your response before pushing it to this branch.

EDIT : as per our private conversation, you're welcome to send a PR with the above suggestions against this branch and i'll gladly review it to the best of my ability.

Copy link
Member

@d4nyll d4nyll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mveroone I think it looks good. Just need to update the branch with the main branch, fix any conflicts, run the tests again (just to make sure) and we should be good 🙏

@mveroone mveroone force-pushed the fix/runtime_metrics/elu branch from 95cff5d to 8398e8f Compare October 27, 2025 19:30
Copy link
Member

@d4nyll d4nyll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@d4nyll d4nyll added the has:owner-approval Approved by Component Owner label Oct 27, 2025
Copy link
Member

@raphael-theriault-swi raphael-theriault-swi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this !

@david-luna david-luna changed the title fix: use absolute results in eventLoopUtilization computation fix(instrumentation-runtime-node) : use absolute results in eventLoopUtilization computation Oct 29, 2025
@david-luna david-luna changed the title fix(instrumentation-runtime-node) : use absolute results in eventLoopUtilization computation fix(instrumentation-runtime-node): use absolute results in eventLoopUtilization computation Oct 29, 2025
@d4nyll
Copy link
Member

d4nyll commented Oct 29, 2025

I see the unit test failing for Node.js v18 with AssertionError [ERR_ASSERTION]: Expected utilization in fifth measurement to be less than 0.1. I'll look into it now.

@david-luna
Copy link
Contributor

@d4nyll @mveroone

Tests are failing for nodejs v18. Could you have a look?

@mveroone
Copy link
Author

I see the unit test failing for Node.js v18 with AssertionError [ERR_ASSERTION]: Expected utilization in fifth measurement to be less than 0.1. I'll look into it now.

I have been trying to reproduce but haven't had any chance to. Either by emulating GHA tests with act or running them locally against 18.0, 19.19 or 18.20.

Could it depend on the parallelization of tests by nx ? Is there a guarantee that the event loop is dedicated to one test at a time while running ? Otherwise it might get flacky depending on test runtime environment.

@d4nyll
Copy link
Member

d4nyll commented Oct 31, 2025

Otherwise it might get flacky depending on test runtime environment.

With a5fe644 it would be very unlikely to be flakey as the event loop would need to be completely busy for the entire 50ms we are waiting for it.

@mveroone can you update the branch with the base branch and @david-luna would you be able to run the tests again afterwards?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants